Towards Semantically-Rich Spatial Network Representation Learning via Automated Feature Topic Pairing

Automated characterization of spatial data is a kind of critical geographical intelligence. As an emerging technique for characterization, spatial Representation Learning (SRL) uses deep neural networks (DNNs) to learn non-linear embedded features of spatial data for characterization. However, SRL extracts features by internal layers of DNNs, and thus suffers from lacking semantic labels. Texts of spatial entities, on the other hand, provide semantic understanding of latent feature labels, but is insensible to deep SRL models. How can we teach a SRL model to discover appropriate topic labels in texts and pair learned features with the labels? This paper formulates a new problem: feature-topic pairing, and proposes a novel Particle Swarm Optimization (PSO) based deep learning framework. Specifically, we formulate the feature-topic pairing problem into an automated alignment task between 1) a latent embedding feature space and 2) a textual semantic topic space. We decompose the alignment of the two spaces into: 1) point-wise alignment, denoting the correlation between a topic distribution and an embedding vector; 2) pair-wise alignment, denoting the consistency between a feature-feature similarity matrix and a topic-topic similarity matrix. We design a PSO based solver to simultaneously select an optimal set of topics and learn corresponding features based on the selected topics. We develop a closed loop algorithm to iterate between 1) minimizing losses of representation reconstruction and feature-topic alignment and 2) searching the best topics. Finally, we present extensive experiments to demonstrate the enhanced performance of our method.


INTRODUCTION
Critical infrastructures (e.g., transportation networks, power networks, social networks, water supply networks) often consist of spatially distributed entities that interact with each other, and have generated massive spatial-networked behavior data. Analyzing such data can identify trends, forecast future behavior, and detect anomalies. To enable effective analysis, it is critical to desire a new capability of automated characterization that effectively extract feature vectors from spationetworked data.
As one of the emerging techniques, representation learning can be adapted to learn non-linear embedded features of spatial network data, which we call spatial representation learning (SRL). There has been a rich body in SRL, including node embedding, autoencoder, random walk, adversarial learning, generative learning based methods with spatial data (Wang and Li, 2017;Wang et al., 2018a;Wang et al., 2018b;Chandra et al., 2019;Jean et al., 2019;Wang et al., 2019a;Wang et al., 2019b;Zhang Y. et al., 2019;Shan et al., 2020;Wang et al., 2020c;Wang et al., 2020d;Wang et al., 2021). Although these works achieved remarkable success, the model interpretability is still a big limitation that hinders these SRL methods from applying in more secure, fair, and rigorous scenarios.
Lacking model interpretability is possible to cause damaging or controversial consequences in incomplete scenarios that are not well-studied (Doshi-Velez and Kim, 2017). For instance, in the autonomous driving scenario, the end-to-end autopilot system brings high safety risks for drivers 1 . In 2015, Google's photo app classifies images of black people as gorillas, which exposes the limitation of algorithms 2 . More seriously, widely used crime prediction software prefers to provide higher risk scores of future crimes for black defendants 3 . Model interpretability is one of the most important approaches to overcome these limitations. Thus how to enhance the model interpretability attracts much attention of researchers (Elshawi et al., 2019;Hong et al., 2020;Stiglic et al., 2020;Poursabzi-Sangdeh et al., 2021). But, many existing works reflect that there is a trade-off between model performance and model interpretability (Mori and Uchihira, 2019;Saisubramanian et al., 2020). Can we not only improve the model interpretability but also keep the model performance becomes the research point of this paper.
To relieve the limitations of prior literature and expand the application scenarios of SRL approaches, a novel SRL model should understand not just which features are effective, but also what these effective features stand for. This issue relates to two tasks: 1) deep representation learning; 2) label generation and matching for latent embedded features. In response, we formulate the problem as a task of feature-topic pairing (Figure 1), which is to align a latent embedding feature space, consisting of multiple latent features, and a textual semantic topic space, consisting of multiple topic labels during SRL. The basic idea is to teach a machine to extract topic labels from texts, and then pair the labels with learned features. To that end, we propose to develop a novel deep learning framework to unify feature learning, topic selection, feature-topic matching.
There are three unique challenges ( Figure 2) in addressing this problem. 1) Label Generation Challenge. The semantically-rich texts of spatial entities describe their types, functions, and attribute-related information. For instance, on a real estate website, the texts of a residential community describe crime rates and events, great school ratings, nearby transportation facilities, grocery stores, companies, and universities. These texts, if properly analyzed, will help to identify which underlying features truly attract residents to pay more to live. However, these spatial texts are all unstructured, how can we construct a textual semantic topic space for spatial entities to support feature-topic pairing? 2) Measurement Challenge. Be sure to note that we aim to teach a machine to automatically perform the automated pairing between embedded features and topic labels in a self-optimizing fashion. As a result, a measurement is needed to quantify the alignment or matching score between the topic label space and the embedding feature space, in order to guide the machine about how to search. However, there is no standard measurement for quantifying the topic-embedding space alignment. Thus, what does form of measurement should be adopted? And how can we integrate the suitable measurement into the whole self-optimizing framework? 3) Optimization Challenge. Since the model needs to decide an optimized topic label subset, the feature-topic pairing problem evolves multiple machine learning tasks, including feature learning, topic label selection, and feature-topic matching. If the three tasks are separately completed step by step, there is no guarantee that they are globally optimized. So, how can we develop a deep optimization framework to jointly and simultaneously unify the three tasks?
To solve the three challenges, we develop a new PSO-based framework (named AutoFTP) that enclose the optimizations of feature learning, topic selection, and feature-topic pairing in a loop. Specifically, our contributions are: 1) Formulating the feature-topic pairing problem. Motivated by lacking feature labels in SRL, we formulate and develop a new problem: feature topic pairing. In the proposed model, we propose a new strategy: we first let an optimizer to automatically select K topics; the optimizer then guides the representation learner to learn K latent features that optimally align with the K topics. 2) Generating candidate topic labels. We propose a three step mining method to generate candidate topic labels. Specifically, we first extract keywords from the texts of all spatial entities. Then, we learn keyword embedding feature vectors with a pretrained word model (He, 2014). Finally, we cluster all keyword embeddings by maximizing inter-topic distances and minimizing intra-topic distances to generate topics as candidate feature labels. 3) Quantifying feature-topic alignment measurement. We identify two types of feature-topic alignments: 1) point-wise alignment, and 2) pair-wise alignment. First, the point-wise alignment is to describe the correlation between an embedding feature vector and a categorical topic distribution. In particular, we maximize the correlation so that the distance between the distribution of the embedding vector space and the distribution of the topic semantic vector space can be minimized. The underlying motivation of point-wise alignment is originated from the fact that: if a topic density is high in describing a spatial entity, the topic's corresponding feature value is expected to be large to co-vary with the topic density. In this way, we align the distribution covariance of the two spaces. Second, the pairwise alignment is to describe the consistency between a featurefeature similarity matrix and a topic-topic similarity matrix. In particular, we use the feature-feature similarity graph to describe the topology of the latent embedding feature space, and use the topic-topic similarity graph to describe the topology of the textual semantic topic space. If the two spaces are aligned, the two graphs (represented by matrices) are similar as well. 4) Optimization in the loop. We develop a Particle Swarm Optimization (PSO)-based algorithm. In this algorithm, we first simultaneously optimize the representation learning loss, point-wise alignment loss, pair-wise alignment loss, and downstream task loss as the feedback for PSO. Guided by the feedback, the PSO based algorithm selects a better K-sized topic subset for feature-topic pairing. In particular, based on the loss function value, PSO iteratively generates topic masks (i.e., 0-1 indicators to select or deselect) to search the optimal topics for space pairing until the learning objective converges.
In this way, the PSO jointly achieves topic selection, featuretopic pairing, and latent feature learning.
Finally, we evaluate our method using Beijing's urban geography and mobility data. For comparison we implemented a broad range of other algorithms. Results showed that our method consistently outperformed the competing methods. We perform ablation study, interpretability, robustness check, stability, sensitivity to justify our technical insights.

PRELIMINARIES AND PROBLEM STATEMENT
In this section, we introduce key definitions of AutoFTP and the problem statement.

Particle Swarm Optimization
PSO is a heuristic optimization algorithm that finds an optimal solution in a dynamic environment, by imitating the social activity of a flock of birds. Figure 3 shows the origin of PSO. A flock of eagles wants to capture a rabbit. To achieve the goal, all eagles exchange information related to the position of the rabbit. Each eagle updates its position based on its current status, velocity, the position where it knew is closest to the rabbit, and the position where the flock knew is closest to the rabbit, until the rabbit is captured.
Similarly, solving the feature-topic pairing problem can be analogized as a task of searching the optimal matching solutions in a dynamic environment. Specifically, we can view the eagles as a set of binary topic selector, which are to select the optimized subset of topics from a candidate topic set for feature-topic pairing. The choices of these binary topic selectors are iteratively updated in order to converge into the ultimate most matched topic-feature pairs. During the iterative process, all the binary topic selectors jointly share the changes of objective function losses (i.e., the losses of representation construction, feature-topic alignment, and downstream predictive task), so the topic selector knows how to update the topic selection next round.

Definitions
Definition 1: Spatial Entity. A spatial entity is a geographical concept that consists of a range (e.g. a circle area with a radius of 1 mile) and a location (i.e. the latitude and longitude of a center). The spatial entity also includes various Points-of-Interest (POIs) of different categories (e.g., buildings of education, shopping, medical, banking, etc.). Definition 2: Point-wise Alignment. To tackle feature-topic pairing, we assume there are 1) an embedding vector that describes the features of a spatial entity and 2) a corresponding topic distribution associated to a spatial entity, which are extracted by optimization. To achieve feature-topic alignment, we propose a point-wise alignment to describe the correlation between features and topics. Figure 4A shows an example of point-wise alignment, we expect to maximize the correlation between the selected topic vector and the spatial FIGURE 3 | The origin of PSO: a flock of eagles is preying on a rabbit. To capturing the rabbit quickly, each eagle records where is the closest position to the rabbit during its history exploration (pBest). Meanwhile, all eagles share the closest position to the rabbit where the flock knew (gBest). All eagles explore any position based on their velocity, pBest, and gBest until the rabbit is captured. embedding vector. The larger the correlation between the two vectors is, the larger the similarity between the two vectors larger is.
Definition 3: Pair-wise Alignment. We propose another perspective (pair-wise) to model the feature-topic alignment. For each entity-entity pair, we compute their feature-feature similarity and topic-topic similarity, and obtain: 1) a topic-topic similarity matrix S; 2) a feature-feature similarity matrix S′. We measure the consistency between the two matrices as the pairwise alignment. Figure 4B shows an example of pair-wise alignment, we aim to let the topic-topic similarity matrix S as close as the featurefeature similarity matrix S′ possible.

The Feature-topic Pairing Problem
The feature-topic pairing problem aims to pair the latent features extracted by representation learning, with the explicit topics of texts of a spatial entity. Formally, given a set of N spatial entities, the n-th entity is described by multiple graphs (e.g., a POI-POI distance graph G d n and a POI mobility connectivity G m n , defined in Section 3.3) and a topic distribution t n extracted from textual descriptions E n . Letr n be the embedding vector of the n-th entity. The objective is to optimize a function that measures representation loss and feature-topic alignment: whereR {r n } N n 1 ∈ R N×K are the embeddings of all spatial entities, K is the number features of an embedding vector.

THE PROPOSED METHOD-AUTOFTP
In this section, we first introduce an overview of our AutoFTP framework, then present its technical details. Figure 5 shows our proposed framework. First, we construct a semantic topic space by extracting topic distribution from the corresponding texts of spatial entities. Then, we initialize a feature embedding space based on the geographical structures of spatial entities. Next, we utilize a PSO-based topic selector to select the optimal K topics for pairing with the spatial embeddings coming from the feature embedding space. During the pairing process, the losses of spatial representation learner, point-wise alignment, pair-wise alignment, and downstream tasks are regarded as feedback to update the topic selector for the next pairing iteration. With the development of the learning iteration, the feature embedding space aligns to the topic semantic space gradually. Finally, the learned spatial embeddings of AutoFTP are effective and semantically rich. Here, to validate the effectiveness of AutoFTP, we apply the framework to predict the real estate price (downstream tasks) of the residential communities (spatial entities) based on spatial embeddings of the communities. The more accurate the prediction is, the more effective the learned embedding is. In addition, the AutoFTP can be generalized to other spatial representation learning problems with graphs and texts.

Textual Topic Extraction
To derive the textual semantic topic space, we extract the topic distributions of spatial entities from texts generated by location based social networks. Traditional topic models, such as LDA (Blei et al., 2003), PSLA (Hofmann, 2013), are implemented based on bag-of-words. These methods ignore word orders in sentences. To improve the performances of topic modeling, we employ a pretrained deep word embedding model (He, 2014) to generate topics.
As illustrated in Figure 6, we first collect the text descriptions of all entities. Besides, we extract keywords from texts using the TextRank algorithm (Mihalcea and Tarau, 2004) and leverage a FIGURE 5 | An overview of AutoFTP. In the framework, we first construct a topic semantic space based on the texts of spatial entities. Then, we initialize a embedding feature space based on the geographical structures of spatial entities. Later, we employ a PSO-based framework to conduct feature-topic pairing through jointly optimizing representation learning, point-wise alignment, pair-wise alignment, and downstream task over learning iterations.
Frontiers in Big Data | www.frontiersin.org October 2021 | Volume 4 | Article 762899 pre-trained language model (He, 2014) to learn the corresponding word embedding of each keyword. Moreover, we exploit a Gaussian Mixture Model (GMM) to cluster the keyword embeddings into T topics. The clustering model provides a topic label for each keyword. To explain the labeling process, we take the i-th keyword's embedding vector x i as an example. First, we assume that the T topics obey a Gaussian Mixture Distribution (GMD). Then we randomly initialize the parameters of GMD. Next, we use the Expectation Maximization (EM) algorithm to find the optimal parameters of the GMD. Finally, we calculate the probability of x i (a.k.a., membership), belonging to each topic based on the GMD, and select the topic with the highest probability as the label of x i . After that, we propose to construct the topic distribution vector of each spatial entity. In particular, for the n-th entity, the topic vector t n is a T dimensional vector, where each dimension indicates a topic, and is filled by the number of associated keywords.

Graph Extraction of Spatial Entities
In order to learn the embedding feature vectors of spatial entities, we propose to construct the graph-structured topology of each spatial entity. This is because there is inherent spatial autocorrelation between each two spatial entities, according to the geographical first law. We describe a spatial entity in terms of its POIs, by building two graphs. 1) POI-POI distance graph: denoted by G d , where POI categories are nodes and the average distances between POI categories are edge weights. 2) POI-POI mobility graph: denoted by G m , where nodes are POI categories, and edge weights are human mobility connectivity. The number of POI categories in this paper is M, and the two graphs are extracted via the method in (Wang et al., 2018a). Specifically, we first use a parametric function to estimate POI visit probability based on a taxi GPS trace data: P(ς) , where ς denotes the distance between a POI and a drop-off position in a taxi trace, β 1 max ς P(ς), and β 2 arg max ς P(ς). We calculate the visited probability of all POIs according to the formula. We sum up the probability of POIs belonging to the same POI category to calculate the visited probability of the POI category. Finally, we calculate the connectivity strength between POI categories as: where P i and P j represent the visited probability of POI category i and POI category j respectively; C ij → indicates the connectivity between POI category i and j.

Spatial Representation Learner
To learn the representations of spatial entities, we utilize the Graph Auto Encoder (GAE) (Kipf and Welling, 2016) to construct latent embedding space. Specifically, to learn the embedding feature vector of the n-th entity, the encoder has two GCN layers. The encoding calculation process can be formulated as follows: where A n , I n ,Ã n ,D n own the same shape R M×M . Moreover, A n is the adjacency matrix, I n is the identity matrix,Ã n is the symmetrically normalized adjacency matrix,D n is the degree matrix. In addition, X n ∈ R M×U is the feature matrix of the graph, in which U is the feature dimension; W (1) n ∈ R U×H is the weight matrix of the first GCN layer, in which H is the output dimension of the layer; W (2) n ∈ R H×K is the weight matrix of the second GCN layer; z n ∈ R M×K is the output embedding of the encoder. The decoder recovers the adjacency matrix according to z n :Â p σ z n z n ′ . ( The optimization objective is to minimize the reconstruction loss between the original graph, denoted by the adjacency matrix A n , and the reconstructed graph, denoted by the adjacency matrixÂ p n : We apply the GAE to the POI-POI distance graph G d n and the POI-POI mobility graph G m n of the n-th spatial entity. After that, we obtain the node representations of G d n and G m n , denoted by z n d ∈ R M×K and z n m ∈ R M×K . Then, we aggregate z n d and z n m by averaging all node embeddings together to attain the graph embedding of G d n and G m n respectively. Finally, we integrate the graph embeddings of G d n and G m n into the unified spatial embedding of the entity by averaging calculation, denoted by r n ∈ R K .

Measuring the Alignment of Embedding and Semantic Spaces
To pair features with topics, we conduct space alignment from the point-wise and pair-wise perspectives. Referring to definitions Section 2.2 and Section 2.3, we aim to align the topic semantic space and feature embedding space from the coordinate system and information contents respectively. During the aligning process, we minimize the point-wise alignment loss L P and pair-wise alignment loss L C . To be convenient, we take the nth entity as an example to explain the calculation process. 1) Point-wise Alignment Loss: L P . We first select K values from the topic vector t n as the vector ťn ∈ R K , which contains the most representative semantics in the semantic space. Then, we maximize the correlation between ťn and the spatial embedding r n , which is equal to minimize the negative correlation between the two vectors. The formula of the minimizing process as follows: where cov(.) denotes the covariance calculation; δ(.) denotes the standard deviation.
2) Pair-wise Alignment Loss: L C . We first construct the topictopic similarity matrix S and the feature-feature similarity matrix S′. Specifically, for S ∈ R K×K , we calculate the similarity between any two topics. For S ′ ∈ R K×K , we calculate the similarity between two features of spatial embeddings. We keep the pair-wise consistency between S and S′ by minimizing the Frobenius norm, as follows:

Supervised PSO For Automatic Topic Selection
As introduced above, we select K topics so the representation learner can learn a K-sized embedding vector in terms of K topics to achieve feature-topic alignment. However, how can the machine automatically identify the best K and select the most appropriate K topics? A naive idea is that we can select K topics randomly at each iteration until we traverse all topic combinations and find the best topic subset based on the objective function. The searching process, however, is time-consuming and computationally expensive. Moreover, the topic selection problem belongs to the combinatorial optimization field, which is hard to solve by derivative-based optimization algorithms. Thus, a quickly and derivative-free optimization algorithm should be selected as our optimizer. Considering the high time complexity for traversing all possible subsets to find the optimal result, we propose to formulate the joint task of feature learning, topic selection, topic and feature matching into a PSO problem.
The PSO-based optimization framework is as illustrated in Figure 7. Specifically, we first randomly initialize a number of particles in PSO, where a particle is a binary topic mask (i.e., the mask value of 1 indicates "select" and the mask value of 0 indicates "deselect"). In other words, a set of particles select a subset of topics. A multi-objective deep learning model, whose objective function includes the losses of graph reconstruction, semantic alignment, and the regression estimator in the downstream task, is trained to learn spatial representations, using each selected topic subset. As an application, we use the embedding of spatial entities (residential communities) to predict their real estate prices, and the loss of the regression model L Reg is: where c n is the golden standard real estate price and c p n is the predicted price. Next, we calculate the fitness of each particle according to the total loss of the deep model. The fitness can be calculated by: Then, we utilize the fitness to inform all particles how far they are from the best solution. Next, each particle moves forward to the solution based on not only its current status but also all particles' movement. After the fitness value of PSO converges, PSO identifies the best topic subset. Finally, the semantically-rich embeddings of spatial entities, given by:R {r n } N n 1 .

EXPERIMENTAL RESULTS
In this section, we present extensive experiments with real world data to answer the following research questions: Q1. How In experiments, the residential regions are treated as spatial entities. The texts reflect the urban utilities and characteristics of spatial entities from multiple perspectives such as traffic condition, economic development, demographic situation, and etc. The real estate prices indicate the average value of the real estate of each spatial entity in 6 months. Thirdly, the POIs are extracted from www.dianping.com, which is a POI (small businesses such as restaurants, banks, gas stations, shopping markets) review website in China. Each POI is described in a format of < POI id, POI category, latitude, longitude >.

Application: Real Estate Price Prediction
Our proposed method (AutoFTP) can learn a list of vectorized representations for all spatial entities. Therefore, as a downstream application, we can apply these representations to train a regression model to predict the average real estate price of these spatial entities. Specifically, we first apply AutoFTP to learn a series of representations of spatial entities based on their geographical structural information and related text descriptions. Then, we build up a deep neural network (DNN) model for predicting average real estate price of each spatial entity according to its corresponding representation. To be convenient, we take the n-th spatial entity as an example to explain the regression model. The formulation of DNN is f(r n , w) w · g(r n ) + b, wherer n is the representation of the n-th spatial entity, g(r n ) is the nonlinear transformation ofr n , w is the weight term, and b is the bias term. We want to minimize the difference between predicted price f(r n , w) and real price y n . Thus, the objective of the DNN is min 1 N N n 1 (y n − f(r n , w)) 2 , where N is the total number of spatial entities.

Evaluation Metrics
We evaluated our method using a real estate price prediction task (Section 4.1.2). We took the feature representation vectors of residential communities as inputs, and predicted their real estate prices. We compared the golden-standard prices y n with the predicted pricesŷ n in terms of four metrics: 1) The regression loss and optimization algorithm are controlled to be the same. The lower the four metrics are, the more effective the spatial embedding features are.

Baseline Algorithms
We compared our proposed method with seven widely-used and robust representation learning (embedding) methods as follows: 1) AttentionWalk (Abu-El-Haija et al., 2018) utilizes a novel attention model to automatically learn the hyper-parameters of random-walk based network embedding methods, which improves the flexibility and performance of the model. We set the learning rate as 0.01, the regularization parameters as 0.5. 2) ProNE (Zhang J. et al., 2019) formulates the network embedding as sparse matrix factorization to improve the calculation speed, and conducts the propagation process in the spectrally modulated space to enhance the representation. We adopt the default parameter setting in (Zhang J. et al., 2019). 3) GatNE (Cen et al., 2019) is a random-walk based network embedding method, which considers the information of different attributes of nodes to enhance the graph representation. We set the number of walks as 20, walk length as 10, window size as 5, patience as 5. 4) GAE (Kipf and Welling, 2016) utilizes GCN to learn the node representations in the encode-decoder paradigm by minimizing the reconstruction loss. We set the number of GCN layers as 2 and the learning rate as 0.0001. 5) DeepWalk (Perozzi et al., 2014) is an extension of the word2vec model (Mikolov et al., 2013), which brings the idea of truncated random walks to a network embedding scenario. We set the number of walks as 80, walk length as 10, and window size as 5. 6) Node2Vec (Grover and Leskovec, 2016) is an enhanced version of DeepWalk, which considers the homogeneity and structural equivalence of networks during embedding process. We set the number of walks as 80, walk length as 10, window size as 5, return parameter p as 0.25 and in-out parameter q as 4. 7) Struc2Vec (Ribeiro et al., 2017) learns the node representation by considering the structural identity of nodes in the network. We set the number of walks as 80 and walk length as 10.
Besides, there are four losses in AutoFTP: reconstruction loss L R , point-wise alignment loss L P , pair-wise alignment loss L C , and regression loss L Reg . The four losses provide the optimization direction of AutoFTP. To study the benefits of each part, we develop four internal variants of AutoFTP: 1) AutoFTP R , which only keeps L R of AutoFTP; 2) AutoFTP (R+P) , which keeps L R and L P of AutoFTP; 3) AutoFTP (R+C) , which keeps L R and L C of AutoFTP; 4) AutoFTP (R+P+C) , which keeps L R , L P , and L C of AutoFTP. The dimension of embeddings in all models is 20.

Hyperparameters, Source Code, and Reproducibility
We detailed the hyperarameters and the steps of our algorithm in the Appendix. We released our code 4 to help to reproduce experimental results.

Environmental Settings
The experimental studies were conducted in the Ubuntu 18.04.3 LTS operating system, plus Intel(R) Core(TM) i9-9920X CPU@ 3.50GHz, 1 way SLI Titan RTX and 128GB of RAM, with the framework of Python 3.7.4, Tensorflow 2.0.0, and Pyswarm 1.3.0. Table 2 shows the comparison of all the 11 models. As can be seen, AutoFTP, in overall, outperforms the baseline algorithms in terms of RMSE, MAE, MAPE and MSLE. A possible reason for this observation is that compared with other baseline algorithms, AutoFTP not just captures geographical structural information but also preserves rich semantics of spatial entity. Besides, the regression estimator (the downstream task) of AutoFTP provides a clear learning direction (accuracy) for spatial representation learning. Thus, in the downstream predictive task, the spatial embedding features learned by AutoFTP beats all baselines. In addition, another interesting observation is that among all baseline models, GatNE outperforms others in terms of all evaluation metrics. Such observation shows that GatNE considers different attributed information of nodes in spatial graphs of spatial entities. Thus, the spatial embedding features learned by GatNE are more effective compared with other baseline models. Moreover, after further observing Table 2, we can find that the predictive performances of GAE are better than most random-walk based approaches, except GatNE. Such observation indicates that the graph convolution-based methods (GAE, AutoFTP) are more suitable than the random-walk based approaches (other baselines) in modeling geographical structure information. In summary, the overall performance experiment shows the superiority and effectiveness of AutoFTP compared with other baseline models.

Study of AutoFTP Variants (Q2)
To validate the necessity of each loss of AutoFTP, we internally compared the performances of AutoFTP with the performances of the variants of AutoFTP. Table 2 shows the ranking orders of the predictive accuracies of the compared methods are: AutoFTP > AutoFTP (R+P+C) > AutoFTP (R+P) > AutoFTP (R+C) > AutoFTP R . A potential interpretation for the observation is that with the increase of optimization objective (loss), AutoFTP captures more characteristics of spatial entities from representation learning, point-wise alignment, pair-wise alignment, and regression task. In addition, compared with AutoFTP (R+P) and AutoFTP (R+C) , we find that the predictive performance of AutoFTP (R+P) is better than AutoFTP (R+C) . A plausible reason for the observation is that the features of spatial entities captured by point-wise alignment are more indicative for spatial entities compared with them learned by pair-wise alignment. Moreover, another interesting observation is that AutoFTP outperforms other variants by a large margin. Such observation indicates that the regression loss L Reg provides a clear optimization direction for AutoFTP, which preserves the features related to the downstream task into spatial embeddings. To sum up, the ablation study experiment demonstrates the four loss functions of AutoFTP are necessary for capturing the representative features in spatial entities during spatial representation learning process.

Study of the Interpretability of Spatial Embeddings (Q3)
The space alignment in AutoFTP is implemented from two perspectives: point-wise alignment and pair-wise alignment. The two kinds of alignment make the learned spatial embeddings contain more semantic meaning and interpretability.

Study of the Point-wise Alignment
To analyze the point-wise alignment, we picked communities (spatial entities) 497, 1,043, 1,126, and 1,232 as examples to plot their corresponding embedding vectors against their corresponding topic vectors. Meanwhile, we extracted the topic names of the most significant 6 topics. Figure 8 shows AutoFTP keeps the point-wise consistency between the semantic feature space and the embedding space. Moreover, the learned spatial embeddings contain abundant semantic meanings. We can infer the urban functions for each community based on   Frontiers in Big Data | www.frontiersin.org October 2021 | Volume 4 | Article 762899 10 Figure 8. For instance, the community #497 exhibits high weights on some specific topics, such as, functional facilities, general education, and construction materials. Such observation indicates that this community is probably a large residential area with welldecorated apartments and general education institutions. The community #1043 and #1126 all have high weights in entertainment, higher education, parks, etc. We can speculate that they are both residential regions nearby universities. This is because the facilities belonging to these topics indicates the two communities are very likely to be in a college town. For the community #1232, it exhibits high weights in district, entertainment and convenience related categories. We can infer that the community is a commercial district with many transportation facilities.

Study of the Pair-wise Alignment
To observe the pair-wise alignment, we visualized the pair-wise topic similarity matrix and pair-wise feature matrix by heat map respectively. As illustrated in Figure 9, we can find that the two matrices are similar with only minor differences. The observation indicates that the embedding feature space is well-matched with the semantic feature space.

Study of the Interpretability
The results of section 4.4.1 and section 4.4.2 shows that the feature embedding space and the topic semantic embedding space are aligned well. To study the interpretability of spatial embeddings further, we built up a tree model for real estate price prediction and then analyze the feature importance based on the semantic labels of the spatial embeddings. Specially, we exploited a random forest model to predict the real estate price of spatial entities based on the corresponding embeddings. Then, we collected the feature importance of the model as illustrated in Figure 10. We can find that the semantic labels of top 5 dimensions in the embeddings that affects the real estate price prediction are "Entertainment", "Transportation", "Security", "Education", and "Business". The three most representative keywords in each semantic label, as shown in Table 3. In common sense, the 5 semantic labels are the most important factors that people consider for buying an estate (Boiko et al., 2020). In other words, they affect the real estate price heavily. Thus, the feature importance analysis experimental results are reasonable. In summary, this experiment validates that AutoFTP can select the most significant topic semantics for feature-topic automatically. In addition, the semantic labels of the spatial embeddings can be regarded as an auxiliary information to improve the interpretability of the embeddings.

Robustness Check (Q4)
To evaluate the robustness of AutoFTP, we divided the embeddings into 5 groups (HaiDian, ChongWen, FengTai, ShiJingShan, FangShan) according to the geographical district of spatial entities. Figure 11 shows that AutoFTP consistently outperforms the baselines, and performs more stably than the baselines across the five districts. Such observation indicates that AutoFTP captures the unique local features of different spatial groups. There are two possible reasons for the observation: 1) the semantic alignment of AutoFTP injects the distinct semantic characteristics of spatial entities into the learned embeddings; and 2) the customized regression estimator provides a clear optimization objective for AutoFTP. Overall, the robustness check experiment demonstrates that AutoFTP outperforms other baseline models in not only the global zone but also each local spatial sub-areas.

Study of the Stability and Sensitivity (Q5)
In this section, we fully evaluated the stability and parameter sensitivity of AutoFTP. We first examined the stability of AutoFTP by analyzing the training losses of AutoFTP and convergence of PSO optimization part. To observe the changing trend of each loss objectively, we scaled the value of losses into [0 ∼ 1] and visualized them in Figure 12A. We can find that all losses (reconstruction loss L R , regression loss L Reg , pointwise loss L P , pair-wise loss L C ) reach convergence over training iterations. Especially, L R and L Reg reach equilibrium quickly only after 10 epochs. This observation validates the training stability of AutoFTP. We also analyzed the convergence of PSO. As shown in Figure 12B, the PSO optimization part reaches convergence after 65 epochs, which further indicates the stable performance of AutoFTP. For the parameter sensitivity evaluation, we investigated the influence of the parameter K (the dimension of final embeddings and the number of significant topics) for the model performance and the training time. The same to Figure 12A, we scaled the value of all metrics into [0 ∼ 1] and visualized them in Figure 12C. We can find that the value of K affects the model performance heavily. The observation is reasonable because K determines the information content of the final learned embeddings. The plots in Figure 12D show that the larger K is, the longer the  training time is. A potential reason for the observation is that the larger K means that we need to try more topic subsets for featuretopic pairing.

RELATED WORK
Graph Representation Learning with Latent Semantics. Graph representation learning refers to techniques that preserve the structural information of a graph into a low-dimensional vector (Wang et al., 2016;Abu-El-Haija et al., 2018;Zhang J. et al., 2019;Cen et al., 2019;Wang et al., 2020b). However, owing to traditional graph representation learning models are implemented by deep neural networks, the learned embeddings lack interpretability. Recently, to overcome this limitation, researchers leveraged the texts related to graphs to learn semantically rich representations. For instance, Mai et al.
implemented an entity retrieval academic search engines that incorporate the text embedding and knowledge graph embedding for accelerating retrieving speed (Mai et al., 2018). Xiao et al. improved the semantic meaning of knowledge graph's embedding by integrating both graph triplets and textual descriptions of spatial entities (Xiao et al., 2017). Different from these studies, in this paper, based on spatial entities data composing by spatial graphs and related texts, we propose a new representation learning framework that unifies feature embedding learning and feature-topic pairing together in a closed-loop manner by a PSO based optimization method. Topic Models in Spatio-temporal Domain. Topic models aim to automatically cluster words and expressions patterns for characterizing documents (Xun et al., 2017;Lee and Kang, 2018;Hu et al., 2019). Recently, to understand the hidden semantics of spatial entities, many researchers applied topic models in the spatio-temporal data mining domain (Zheng et al., 2017;Huang et al., 2019;Huang et al., 2020). For instance, Zhao et al. discovered representative and interpretable human activity patterns from transit data automatically by a spatio-temporal topic model . Yao et al. tracked spatio-temporal and semantic dynamics of urban geo-topics based on an improved dynamic topic model that embeds spatial factors of pairwise distances between tweets (Yao and Wang, 2020). These successful applications validate the effectiveness of topic models for extracting semantics in spatio-temporal domains. However, traditional topic models only focus on word frequency in texts but neglect the semantics of words. Recently, the success of many pre-trained language models (Vaswani et al., 2017;Kenton and Toutanova, 2019;Yang et al., 2019) brings hope for producing more reasonable topic distribution. Thus, in this paper, we employ a pre-trained language model to get the embeddings of keywords and utilize Gaussian Mixture Model to extract topic distribution based on the embeddings.
Explainable Artificial Intelligence (XAI) With artificial intelligence methods are applied in multiple scenarios successfully, how to improve the model explainability becomes a big challenge. In the traditional machine learning domain, researchers employ some simple models that own the explainability naturally such as linear models, decision trees, rule-based models, and etc to explain the modeling process (Burkart and Huber, 2021;Lakkaraju et al., 2016;Lakkaraju FIGURE 11 | Robustness check according to geographical district.  (Lundberg et al., 2020), improved the global interpretability of tree models by combining many local feature explanations of each prediction and obtained good performance on three medical machine learning problems by applying these models (Wang and Rudin, 2015). provided a Bayesian framework for learning falling rule lists that do not rely on traditional greedy decision tree learning approaches to improve the explainability of classification models. Although these approaches can improve the model interpretability, the model performance often is sacrificed. Recently, the excellent predictive performance of deep learning models leads the techniques have been applied in many scenarios such as fraud detection, credit evaluation, healthcare, etc. But explainability is the key limitation of the deep learning models. To improve the model explainability, XAI on deep learning attracts much attention from researchers (Gunning, 2017;Selvaraju et al., 2017;Samek and Müller, 2019;Agarwal et al., 2020). For instance (Selvaraju et al., 2017), proposed a gradient-weighted class activation mapping method to highlight the import regions in the image for predicting the concept. (Agarwal et al., 2020). proposed neural additive models that learns a linear combination of neural networks for depicting the complex relationships between input features and the output. However, these models focus on studying the relationship between the embeddings and outputs, but cannot provide explicit semantic meanings. Different from these studies, we try to give explicit semantic labels for the learned embeddings through the alignment between the feature embedding space and topic semantic space. Comparison with Prior Literature As an emerging feature extraction technique, deep SRL has demonstrated the power in automated geographic and spatial feature extraction. However, SRL inherits drawbacks of traditional DNNs, such as: the embedding feature space lacks semantic interpretation. Texts can provide more interpretation, but spatial text mining has developed separately. Now, there is cross and increasing interests in both fields to benefit from the advances of the other. Our study targets at an unexplored area at the intersection between representation learning in geospatial data and topic label mining in texts. We develop and formulate a new problem: feature-topic pairing, to address the alignment challenges of the feature embedding space and the semantic topic space. The self-optimizing solution unifies representation learning, topic label selection, feature-topic matching in a PSO framework. This framework can be generalized to other integrated tasks, such as, representation learning integrated with not just topic based selection, but also causal selection, or other constrained selection over features, in various application senarios. This is how this study differentiates from and advances prior literature.

CONCLUSION
We presented a novel spatial representation learning (SRL) framework, namely AutoFTP. The spatial embeddings produced by traditional SRL models lack semantic meaning. To overcome this limitation, we formulated the feature-topic paring problem. We proposed a novel deep learning framework to unify representation learning, topic label selection, and feature-topic pairing. Specifically, we designed a segmentation-embedding-clustering method to generate candidate feature topic labels from texts. We developed an integrated measurement to measure the pointwise and pairwise alignment between topic label and embedding feature space. We devised a PSO based optimization algorithm to effectively solve the joint task of feature learning and feature-topic pairing. Our method integrated spatial graphs and associated texts to learn effective embedding features with visible labels. Extensive experiments demonstrated the effectiveness of AutoFTP by comparing it with other baseline models. The topic labels of the learned features were shown by many case studies and the feature importance analysis of a downstream task. For future work, we plan to extend our approach from geospatial networks to other applications that consist of graphs and texts, such as social media and software code safety.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://www.dropbox.com/sh/ woqh4qvuzq1788r/AAB5Vz1DSeJiLKxq-POHLMAVa?dl 0.

AUTHOR CONTRIBUTIONS
DW proposes the main idea, finishes major experiments, writes the paper. KL helps accomplish partial experiments and writes some paragraphs in the paper. DM helps improve the presentation of the paper PW helps modify some typos and errors in the paper C-TL improves the presentation and language of the paper YF improves the presentation of the paper and provides the experimental data and devices.